11 research outputs found

    Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees

    Full text link
    We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the nn-th sample point is O(logn)O(\log n) in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.Comment: Camera-ready version published in NeurIPS 201

    PALSE: Python Analysis of Large Scale (Computer) Experiments

    Get PDF
    A tenet of Science is the ability to reproduce the results, and a related issue is the possibility to archive and interpret the raw results of (computer) experiments. This paper presents an elementary python framework addressing this latter goal. Consider a computing pipeline consisting of raw data generation, raw data parsing, and data analysis i.e. graphical and statistical analysis. palse addresses these last two steps by leveraging the hierarchical structure of XML documents. More precisely, assume that the raw results of a program are stored in XML format, possibly generated by the serialization mechanism of the boost C++ libraries. For raw data parsing, palse imports the raw data as XML documents, and exploits the tree structure of the XML together with the XML Path Language to access and select specific values. For graphical and statistical analysis, palse gives direct access to ScientificPython, R, and gnuplot. In a nutshell, palse combines standards languages ( python, XML, XML Path Language) and tools (Boost serialization, ScientificPython, R, gnuplot) in such a way that once the raw data have been generated, graphical plots and statistical analysis just require a handful of lines of python code. The framework applies to virtually any type of data, and may find a broad class of applications

    Nonparametric methods for learning and detecting multivariate statistical dissimilarity

    No full text
    Cette thèse présente trois contributions en lien avec l'apprentissage et la détection de dissimilarité statistique multivariée, problématique d'importance primordiale pour de nombreuses méthodes d'apprentissage utilisées dans un nombre croissant de domaines. La première contribution introduit la notion de taille d'effet multivariée non-paramétrique, éclairant la nature de la dissimilarité détectée entre deux jeux de données, en deux étapes. La première consiste en une décomposition d'une mesure de dissimilarité (divergence de Jensen-Shannon) visant à la localiser dans l'espace ambiant, tandis que la seconde génère un résultat facilement interprétable en termes de grappes de points de forte discrépance et en proximité spatiale. La seconde contribution présente le premier test non-paramétrique d'homogénéité séquentiel, traitant les données issues de deux jeux une à une--au lieu de considérer ceux-ci- in extenso. Le test peut ainsi être arrêté dès qu'une évidence suffisamment forte est observée, offrant une flexibilité accrue tout en garantissant un contrôle del'erreur de type I. Sous certaines conditions, nous établissons aussi que le test a asymptotiquement une probabilité d'erreur de type II tendant vers zéro. La troisième contribution consiste en un test de détection de changement séquentiel basé sur deux fenêtres glissantes sur lesquelles un test d'homogénéité est effectué, avec des garanties sur l'erreur de type I. Notre test a une empreinte mémoire contrôlée et, contrairement à des méthodes de l'état de l'art qui ont aussi un contrôle sur l'erreur de type I, a une complexité en temps constante par observation, le rendant adapté aux flux de données.In this thesis, we study problems related to learning and detecting multivariate statistical dissimilarity, which are of paramount importance for many statistical learning methods nowadays used in an increasingly number of fields. This thesis makes three contributions related to these problems. The first contribution introduces a notion of multivariate nonparametric effect size shedding light on the nature of the dissimilarity detected between two datasets. Our two step method first decomposes a dissimilarity measure (Jensen-Shannon divergence) aiming at localizing the dissimilarity in the data embedding space, and then proceeds by aggregating points of high discrepancy and in spatial proximity into clusters. The second contribution presents the first sequential nonparametric two-sample test. That is, instead of being given two sets of observations of fixed size, observations can be treated one at a time and, when strongly enough evidence has been found, the test can be stopped, yielding a more flexible procedure while keeping guaranteed type I error control. Additionally, under certain conditions, when the number of observations tends to infinity, the test has a vanishing probability of type II error. The third contribution consists in a sequential change detection test based on two sliding windows on which a two-sample test is performed, with type I error guarantees. Our test has controlled memory footprint and, as opposed to state-of-the-art methods that also provide type I error control, has constant time complexity per observation, which makes our test suitable for streaming data

    Méthodes non-paramétriques pour l'apprentissage et la détection de dissimilarité statistique multivariée

    Get PDF
    In this thesis, we study problems related to learning and detecting multivariate statistical dissimilarity, which are of paramount importance for many statistical learning methods nowadays used in an increasingly number of fields. This thesis makes three contributions related to these problems. The first contribution introduces a notion of multivariate nonparametric effect size shedding light on the nature of the dissimilarity detected between two datasets. Our two step method first decomposes a dissimilarity measure (Jensen-Shannon divergence) aiming at localizing the dissimilarity in the data embedding space, and then proceeds by aggregating points of high discrepancy and in spatial proximity into clusters. The second contribution presents the first sequential nonparametric two-sample test. That is, instead of being given two sets of observations of fixed size, observations can be treated one at a time and, when strongly enough evidence has been found, the test can be stopped, yielding a more flexible procedure while keeping guaranteed type I error control. Additionally, under certain conditions, when the number of observations tends to infinity, the test has a vanishing probability of type II error. The third contribution consists in a sequential change detection test based on two sliding windows on which a two-sample test is performed, with type I error guarantees. Our test has controlled memory footprint and, as opposed to state-of-the-art methods that also provide type I error control, has constant time complexity per observation, which makes our test suitable for streaming data.Cette thèse présente trois contributions en lien avec l'apprentissage et la détection de dissimilarité statistique multivariée, problématique d'importance primordiale pour de nombreuses méthodes d'apprentissage utilisées dans un nombre croissant de domaines. La première contribution introduit la notion de taille d'effet multivariée non-paramétrique, éclairant la nature de la dissimilarité détectée entre deux jeux de données, en deux étapes. La première consiste en une décomposition d'une mesure de dissimilarité (divergence de Jensen-Shannon) visant à la localiser dans l'espace ambiant, tandis que la seconde génère un résultat facilement interprétable en termes de grappes de points de forte discrépance et en proximité spatiale. La seconde contribution présente le premier test non-paramétrique d'homogénéité séquentiel, traitant les données issues de deux jeux une à une--au lieu de considérer ceux-ci- in extenso. Le test peut ainsi être arrêté dès qu'une évidence suffisamment forte est observée, offrant une flexibilité accrue tout en garantissant un contrôle del'erreur de type I. Sous certaines conditions, nous établissons aussi que le test a asymptotiquement une probabilité d'erreur de type II tendant vers zéro. La troisième contribution consiste en un test de détection de changement séquentiel basé sur deux fenêtres glissantes sur lesquelles un test d'homogénéité est effectué, avec des garanties sur l'erreur de type I. Notre test a une empreinte mémoire contrôlée et, contrairement à des méthodes de l'état de l'art qui ont aussi un contrôle sur l'erreur de type I, a une complexité en temps constante par observation, le rendant adapté aux flux de données

    Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces

    Get PDF
    Comparing two sets of multivariate samples is a central problem in data analysis. From a statisticalstandpoint, the simplest way to perform such a comparison is to resort to a non-parametric two-sampletest (TST), which checks whether the two sets can be seen as i.i.d. samples of an identical unknowndistribution (the null hypothesis). If the null is rejected, one wishes to identify regions accounting forthis difference. This paper presents a two-stage method providing feedback on this difference, basedupon a combination of statistical learning (regression) and computational topology methods.dConsider two populations, each given as a point cloud in R^d. In the first step, we assign a labelto each set and we compute, for each sample point, a discrepancy measure based on comparingan estimate of the conditional probability distribution of the label given a position versus the globalunconditional label distribution. In the second step, we study the height function defined at each pointby the aforementioned estimated discrepancy. Topological persistence is used to identify persistentlocal minima of this height function, their basins defining regions of points with high discrepancy andin spatial proximity.Experiments are reported both on synthetic and real data (satellite images and handwritten digitimages), ranging in dimension from d = 2 to d = 784, illustrating the ability of our method to localizediscrepancies.On a general perspective, the ability to provide feedback downstream TST may prove of ubiquitousinterest in exploratory statistics and data science

    Un Test Non-paramétrique d'Homogénéité Séquentiel

    Get PDF
    Given samples from two distributions, a nonparametric two-sample testaims at determining whether the two distributions are equal or not,based on a test statistic. This statistic may be computed on the wholedataset, or may be computed on a subset of the dataset by a functiontrained on its complement. We propose a third tier, consisting offunctions exploiting a sequential framework to learn the differenceswhile incrementally processing the data. Sequential processingnaturally allows optional stopping, which makes our test the firsttruly sequential nonparametric two-sample test.We show that any sequential predictor can be turned into a sequentialtwo-sample test for which a valid pp-value can be computed, yieldingcontrolled type I error. We also show that pointwise universalpredictors yield consistent tests, which can be built with anonparametric regressor based on kk-nearest neighbors in particular.We also show that mixtures and switch distributions can be used toincrease power, while keeping consistency

    A Sequential Non-Parametric Multivariate Two-Sample Test

    No full text
    International audienc
    corecore